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Abstract 

By observing trends in the folding kinetics of experimental 2-state proteins at their transition 
midpoints, and by observing trends in the barrier heights of numerous simulations of coarse grained, 
Cq model. Go proteins, we show that folding rates correlate with the degree of heterogeneity in 
the formation of native contacts. Statistically significant correlations are observed between folding 
rates and measures of heterogeneity inherent in the native topology, as well as between rates and 
the variance in the distribution of either experimentally measured or simulated </>-values. 
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Protein folding is a relaxation process driven by a first order like fluctuation of a critical 



nucleus |1|. Because proteins are evolutionarily designed to fold to a particular structure, 
frustrating interactions are minimized and the folding process can be projected onto one or 
a few reaction coordinates without too much loss of information [2]. This projection yields 
a free energy surface whose structure is subject to much interest. Different proteins have 
different free energy surfaces with different barrier heights. 

What factors determine the height of the folding free energy barrier for the various pro- 
teins? As one would expect, the barrier decreases as the energetic stability of the folded 
structure increases \^ . Moreover folding rates tend to increase with energetic discrimination 
measures between the folded state and unfolded or misfolded decoys As one might also 
expect, the barrier increases for native structures that have longer polymer loops formed 
during folding. A property capturing this effect, dubbed absolute contact order (AGO), 
measures the mean sequence separa^on between amino acids in close proximity (and thus 
interacting) in the native structure |5|: ACO =1 = (1/M) J2i<j K ~ Jl'^fj where i and j la- 
bel amino acid index, = 1 (or 0) if amino acids i and j are (or are not) interacting in the 
native structure, and M is the total number of contacts in the native structure determined 
by either heavy side chain atoms or Cq, atoms within a cut-off distance of 4.8 A ^. 

In what follows we first re-examine the trend of rates with i in light of theoretical predic- 
tions 0,0,0], then we will go on to further examine higher-order aspects of native topology 
(and energetics) that act as predictors of folding rate. 

If we take data that first corrects for the effects of differing native stabilities for different 
proteins by adjusting denaturant concentration to conditions at the transition midpoint, 
and then plot the log folding rate vs i, we find a statistically significant correlation for a 
representative set of 19 2-state proteins (and pi3~i4 circular permutant of S6) (Fig.^A.) jlol |. 
Observations similar to this led the folding community to accept the idea that properties 
of native topology strongly determine folding rate Q|. Moreover if one simulates off-lattice 



Ca Go models [d| to 18 structures of known 2-state folders jl2], one also finds a statistically 
significant correlation between barrier height and absolute contact order (Fig. ^3). One also 
notices from Fig. ^ that there must be more to the story then absolute contact order in 
determining folding rates, since the fluctuations around the best fit line are significant. 

The effects of native topology (and energetics) should be describable analytically as well. 
To this end a free energy functional approach was developed 0, 0, ^ within which it was 
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FIG. 1: (A) Logarithm of experimental folding rate (in sec~^) at the transition midpoint vs 
absolute contact order or mean sequence separation between interacting residues in the native 
structure, J. Wild type protein S6 is shown by an open square and pi3-i4 cip^^iar permutant of 



S6 is shown by an open circle. (B) The equivalent measure in Go simulations is —AF^-^/Tf, 
again plotted vs J. Both show a statistically significant anti-correlation: r (or r) is the correlation 
coefficient (or Kendall's Tau). Statistical significance is defined here by the probability P{r) (P(r)) 
to observe a given correlation coefficient or greater by chance. If P(r) (P(r)) < 0.05, the depen- 
dence is typically deemed statistically significant 14]. Shown in (A) are 19 proteins (and pi3-i4 
circular permutant of 86) for which experimental rate data are available at various denaturant 
concentrations [lo| and in (B) 18 simulated Go model proteins [l^ . 

shown that the free energy barrier may be written in terms of an expansion involving mo- 
ments of distributions of native contact interaction energies {eij}, and native contact se- 
quence separations {%} = {K — The lowest order corrections to the mean-field barrier 
are 

(1) 
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where A, B, C are all positive and of order unity. The lowest order mean field term AF = 

3 



AF^{e,£), where e, i are the first moments (mean) of the distributions, indeed increases as 
i increases, consistent with the observed trend. The theory gives the slope ttimf of the mean 
field barrier £ as Q 

,rT.^ /o« in /oN/.^/^2, , ,-.1/2 



= di-AF'/T) /di ^ -{3/2){M/i ) ln(£ ' /2). (2) 

Calculating Eq. for all proteins used in Fig. [Tl\, ttimf = —0.41 ± 0.09, which is 
consistent with the slope of the best fit line —0.36. The mean field slope for the proteins in 
Fig. ^3 is — 0.42±0.08 which is almost twice the slope of the best fit line —0.19. There may be 
several reasons for this, including the fact that the theory used the mean field approximation, 
while the nucleus may be better approximated by a capillary model, and the Gaussian 
approximation for polymer loops used in the theory may be poor for many contacts. There 
may also be a cancellation of errors in Fig. ^\ due to the presence of a capillary nucleus with 
many-body interactions present which would result in unexpectedly good agreement. 

Second order terms in Eq. (^J involving the fluctuations of native energies and loop lengths 
contact to contact all tend to decrease the barrier, leading to the notion that proteins with 
.o. Kete.o„ to,d„, .echan... .Kould fold t... BQ. We no. .Kat .e. . n,o.e 
heterogeneous folding mechanism corresponds to a more specific, polarized folding nucleus, 
i.e. the heterogeneity here refers to contact formation probability, not conformational di- 
versity of the transition state. Earlier lattice-simulation studies jlf| as well as more recent 



experimental studies of circular permutants |13(] support the notion that a more polarized 
nucleus results in a faster folding protein. 

We can readily check if the second moment of the loop length distribution has an observ- 
able effect on rates, even if we ignore variations due to different i values protein to protein, 
as well as the terms with coefficients A and B in Eq. (PJ). The functional theory gives coeffi- 
cient C ^ in Eq. ([T)) [9J , so the change in barrier height due to the presence of structural 
variance is: 

(AF* - AF^) /MT = 6AF^ /MT ^ -Q^W /f . (3) 

Here, Q is the overall fraction of native contacts, and is the value of Q at the barrier 
peak. 

Plots of experimental log folding rate and simulated barrier heights (over MT) both show 
statistically significant correlation with 5P/J^ (Fig. |21). 
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FIG. 2: Plotted in (A) are log experimental rate data (at the transition midpoints) and in (B), 
simulated barriers (at Ty), as a function of the measure of structural heterogeneity that appears 
in the functional theory in Eq.s and Both show a moderate, but statistically significant 
correlation with structural variance. Three a/ (3 proteins (A-repressor chain 3, cytochrome c, yeast 
iso-l-cytochrome c) tend to have both large structural variance and fast folding rates. 

However there are large fluctuations present, and the slope of the best fit line is only about 
a tenth the theoretical prediction. Neglecting trends due to contact order and energetic 
variance introduces errors in the plots. 

» — I 

Experimentally measured 0- values [17J involve both energetics and entropies and should 
better capture the effects of heterogeneity in folding mechanism. The variance in 0-values 
couples together the last 3 terms in Eq. ((H). To facilitate a comparison of rates with 0- 
variance, the free energy barrier maybe recast in terms of the variance in native contact 
formation probabilities {Qij) 0] 

SAF^MT ^ -IQ^/2QK (4) 

Eq. (jU only includes the effects of heterogeneity in polymer loop length, however energetic 
heterogeneity can be incorporated as well, which only changes the coefficient {1/2Q^) in 
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FIG. 3: Plots of log experimental folding rate (over M) for a subset of proteins in Fig. for 
which experimental (p values are available and minus free energy barrier (over MT) for simulated 
proteins vs 0-variance. Both show strong statistically significant correlation. In particular the 
trend in experimental data is strong even though the number of proteins with available data for 
both (/)-variance and transition midpoint rate is not large. Experimental data for wild type S6 is 
shown by an open square and pi3-i4 cij-^ular permutant of S6 13 1 is shown by an open circle which 
fits very well to the rest of the data and increases correlation. The strong correlation remains upon 
dividing by chain length instead of total number of contacts M. 



Eq. to (3/2(5"'') • The simulations have no variance in native contact energies, moreover 
statistics arguments suggest that this native variance may be significantly reduced with 
respect to the variance in collapsed random structures 
0- values may be defined analytically as jisl. Il8| 



(5) 



where Q^^, Qjj and Qf,- are the probabilities of native contact formation between residues i 



and j in the unfolded, transition and folded states respectively. 
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In the approximation that all contacts are fully formed in the native structure (Qp = 1), 
and unformed in the unfolded structures (Qu = 0), the 0- value for residue i is the mean of 
Qij values in the transition state (c.f. Eq. Further approximating the same number of 
nearest neighbors z for all residues, the variances are related by ^ [l/z)6Q'^. If we make 
no approximations and simply plot 6Q'^ vs. 6(j)'^ (for the simulation data), the quantities 
correlate extremely well (see Table I) with a slope of ~ 1.2 and an intercept —0.04 . The 
intercept may be non-zero since other fluctuating quantities (e.g. Q^, z) contribute to the 
variance of ^-values. 

The above arguments indicate SQ"^ and 50^ are within a factor of approximately unity, 
so we rewrite Eq. in the form 

dAF^MT ^ -D6^ (6) 

with D a parameter of order unity. 

According to Eq. (jH)) more polarized nuclei have lower free energy barriers. Plots of 
—AF^/MT vs 50^ for experiments and simulations are shown in Fig. El Here we see a strong 
statistically significant correlation of both rates and barriers with variance. Moreover the 
slopes of the best fit lines (~ 0.3) compare somewhat more favorably with the theoretically 
predicted values (~ 0.8) than was the case for structural variance. A precise comparison with 
experimental data is more difficult since the coordination number z as well as the numbers 
Qu and Qf are not accurately known for all proteins. Taking the slope from Fig. and 
using the approximations mentioned above allows us to infer the residue-residue coordination 
number: z A ii energetic heterogeneity is negligible (Eq. Q), 2; 11 if it is substantial 
(Eq. dH) with coefficient 3/2Q*). 

The residuals of —AF^/MT vs 1, when plotted against Si'^/i^ and 50^, show comparable 
but typically slightly less significant correlation (within 10%) to those in Fig. |21and Fig. 01 
The term 5AF^/MT can be thought of as a measure of these residuals. We have plotted 
absolute rates, which are easily measurable from experiments or simulations, while the mean- 
field barrier is not. 

We note that Stpl^^ has errors due both to experimental measurement as well as the small 
set of 0-values for each protein. Moreover the experimental rates at the transition midpoint 
are compared to the variance in 0's typically measured in water or stabilizing conditions. 
Interestingly, experimental folding mechanisms tend to be more polarized than uniform Go 
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models. 

In the case of the simulations, the correlation between 5(j)^ vs 5IP /it is strong as expected, 
since there is no variance in native contact energies, by construction of the model. For exper- 
imental data however the correlation is poor, which implies that there may be substantial 
energetic heterogeneity present in native contact energies of real proteins. It is not too 
surprising then that there is no correlation between the variance of experimental 0-values 
and simulation 0- values (see Table H}. In the analysis then, simulated barriers were plot- 
ted against simulated ^-variance, and experimental rates were plotted against experimental 
0- variance. 

We did not find any significant correlation between rates and structural variance 51'^ /t 
for 3-state folders. Here there is the intriguing picture that (on-pathway) intermediates in 
3-state folders are in fact induced by structural or energetic heterogeneity, so that there is 
no a priori reason for folding rates to continue to increase with increasing heterogeneity. 

S6 displays significant correlation between native contact energies and native loop 
lengths [isj. For this reason we did not include it in Fig. |21A., which only includes a 
structural measure of heterogeneity- if it is included the correlation decreases to r = 0.57, 
P(r) = 9.6 X 10~^. We note that the inclusion of the two data points corresponding to S6 
does not change the correlation in Fig. and decreases the correlation in Fig. by 8%. 

We showed here that both experimental rates and simulated free energy barriers for 2- 
state proteins depend on the degree of heterogeneity present in the folding process. The 
results compared quite well with the predictions of the free energy functional theory ^, 
Isj. Heterogeneity due to variance in the distribution of native loop lengths, as well as 
variance in the distribution of 0-values, were both seen to increase folding rates and reduce 
folding barriers. The observed effect due 0-variance was the most statistically significant (as 
expected), because 0- variance captures both heterogeneity arising from native topology as 
well as that arising from energetics. 
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TABLE I: Correlation coefficient and statistical significance for various quantities. 
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°2-sided statistical significance has been used. 

''Here we divide by the number of native contacts M . Dividing instead by chain length N gives correlations 

within 10%. M and N correlate very strongly (r — 0.94). 
•^Data from both simulated and experimental proteins used. 
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